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Abstract 

It is a challenging task to select correlated variables in a high dimen- 
sional space. To address this challenge, the elastic net has been developed 
and successfully applied to many applications. Despite its great success, 
the elastic net does not explicitly use correlation information embedded in 
data to select correlated variables. To overcome this limitation, we present 
a novel Bayesian hybrid model, the EigenNet, that uses the eigenstruc- 
tures of data to guide variable selection. Specifically, it integrates a sparse 
conditional classification model with a generative model capturing vari- 
able correlations in a principled Bayesian framework. We reparameterize 
the hybrid model in the eigenspace to avoid overfiting and to increase the 
computational efficiency of its MCMC sampler. Furthermore, we provide 
an alternative view to the EigenNet from a regularization perspective: the 
EigenNet has an adaptive eigenspace-based composite regularizer, which 
naturally generalizes the I1/2 regularizer used by the elastic net. Exper- 
iments on synthetic and real data show that the EigenNet significantly 
outperforms the lasso, the elastic net, and the Bayesian lasso in terms 
of prediction accuracy, especially when the number of training samples is 
smaller than the number of variables. 



1 Introduction 



In this paper we consider the problem of selecting correlated variables in a 
high dimensional space. Among many variable selection methods, the lasso 
and the elastic net are two popular choices Tibshirani 1994 Zou & Hastie 



2005 . The lasso uses a li regularizer on model parameters. This regularizer 
shrinks the parameters towards zero, removing irreverent variables and yielding 
a sparse model Tibshirani 1994 . However, the li penalty may lead to over- 
sparisification: given many correlated variables, the lasso often only select a few 
of them. This not only degenerates its prediction accuracy but also affects the 
interpretability of the estimated model. For example, based on high-throughput 



biological data such as gene expression and RNA-seq data, it is highly desirable 
to select multiple correlated genes specific to a phenotype since it may reveal 
underlying biological pathways. Due to its over-sparsification, lasso may not be 
suitable for this task. 

To address this issue, the elastic net has been developed to encourage a 
grouping effect, where strongly correlated variables tend to be in or out of the 



model together Zou & Hastie 2005 . However, the grouping effect is just the 
result of its composite h and I2 regularizer; the elastic net does not explicitly 
incorporate correlation information among variables in its model. 

In this paper, we propose a new sparse Bayesian hybrid model, called the 
EigenNet. Unlike the previous sparse models, it uses the eigen information from 
the data covariance matrix to guide the selection of correlated variables. Specif- 
ically, it integrates a sparse conditional classification model with a generative 



model capturing variable correlation in a principle Bayesian framework Lasserre 



et al. 2006 . The hybrid model enables identification of groups of correlated vari- 



ables guided by the eigenstructures. Also, it passes the information from the 
conditional model to the generative model, selecting informative eigenvectors 
for the classification task. Unlike frequentist approaches, the Bayesian hybrid 
model can reveal correlations between classifier weights via their joint posterior 
distribution. 

We reparameterize the model in the eigenspace of the data. When the num- 
ber of predictor variables (i.e., input features), (p), is bigger than the number 
of training samples (n), this reparameterization restricts the model in the data 
subspace, which not only reduces overfitting, but also allows us to develop effi- 
cient Markov Chain Monte Carlo sampler. 

From the regularization perspective, the EigenNet naturally generalizes the 
elastic net by using a composite regularizer adaptive to the data eigenstructures. 
It contains a h sparsity regularizer and a directional regularizer that encourages 
selecting variables associated with eigenvectors chosen by the model. When the 
variables are independent of each other, the eigenvectors are parallel to the axes 
and this composite regularizer reduces to the I1/2 regularizer used by the elastic 
net; when some of the input variables are strongly correlated, the regularizer will 
encourage the classifier aligned with eigenvectors selected by the model. On one 
hand, our model is like the elastic net to retain 'all the big fish'. On the other 
hand, our model is different from the elastic net by using the eigenstructure. 
Hence the name EigenNet. 

Experiments on synthetic and real data are presented in Section 7. They 
demonstrate that the EigenNet significantly outperforms the lasso, the elastic 



net, and the Bayesian lasso Park et al. 2008 Hans 2009 in terms of prediction 



accuracy, especially when the number of training samples is smaller than the 
number of features. 
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2 Background: lasso and elastic net 



We denote n independent and identically distributed samples as 

= {(xi,yi), • . ■ , (x„,?;„)} 

, where is ap dimensional input features (i.e., explanatory variables) and j/i is 
a scalar label (i.e., response). Also, we denote [xi, . . . ,x„] by X and (yi, . . . ,yn) 
by y. In this paper, we consider the binary classification problem {yi € {—1, 1}), 
but our analysis and the proposed models can be extended to regression and 
other problems. 

For classification, we use a logistic function as the data likelihood function: 



p(y|X, w, 6) = ]J tT(y,(wTx, + h)) 



(1) 



where a{z) = 



l+cxp( — z) 



and w and h define the classifier. 



To identify relevant variables for high dimensional problems, the lasso Tib 



shirani 1994] uses a li penalty, effectively shrinking w and b towards zero and 
pruning irrelevant variables. In a probabilistic framework this penalty corre- 
sponds to a Laplace prior distribution: 



P(w) = Jl Aexp(-A|wj |) 



(2) 



where A is a hyperparameter that controls the sparsity of the estimated model. 
The larger the hyperparameter A, the sparser the model. 

As described in Section 1, the lasso may over-penalize relevant variables and 
hurt its predictive performance, especially when there are strongly correlated 
variables. To address this issue, the elastic net Zou & Hastie 2005 combines 
li and I2 regularizers to avoid the over-penalization. The combined regularizer 
corresponds to the following prior distribution: 



p(w) cx ]^exp(-Ai|ii;j| - A2w|) 



(3) 



where Ai and A2 are hyperparameters. While it is well known that the elastic 
net tends to select strongly correlated variables together, it does not uses corre- 
lation information embedded in the data. The selection of correlated variables 
is merely the result of a less aggressive regularizer for sparisty. 

Besides the elastic net, there are many variants (and extensions) to the lasso, 
such as the bridge Frank & Friedman 1993 and smoothly clipped absolute 



deviation Fan & Li 2001 . These variants modify the li penalty to choose 



variables, but again do not explicitly use correlation information in data. 
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3 EigenNet: eigenstructure-guided variable se- 
lection 



In this section, we propose to use covariance structures in data to guide the 
sparse estimation of model parameters. 

First, let us consider the following toy examples. 



3.1 Toy examples 

Figure [T(a) | shows samples from two classes. Clearly the variables and x'^ are 
not correlated. The lasso or the elastic net can successfu lly se lect the relevant 
variable x^ to classify the data. For the samples in Figure 1(b) , the variables x^ 
and x^ are strongly correlated. Despite the strong correlation, the lasso would 
select only x^ and ignore x^. The elastic net may select both x^ and x'^ if the 
regularization weight Ai is small and A2 is big, so that the elastic net behaves 
like I2 regularized classifier. The elastic net, however, does not explore the fact 
that x^ and are correlated. 




(a) Independent variables (b) Correlated variables 

Figure 1: Toy examples, (a) When the variables x^ and x^ are independent 
of each other, both the lasso and the EigenNet select only x^. (b) When the 
variables x^ and x'^ are correlated, the lasso selects only one variable. By con- 
trast, guided by the major eigenvector of the data, the EigenNet selects both 
variables. 



Since the eigenstructure of the data covariance matrix captures correlation 
information between variables, we propose to not only regularize the classifier 
to be sparse, but also encourage it to be aligned with certain eigenvector(s) 
that are helpful for the classification task. Since our new model uses the eigen 
information, we name it the EigenNet. 

since the two eigenvectors are parallel with 



For the data in Figure 1(a) 



the horizontal and vertical axes, the Eigen Net es sentially reduces to the elastic 
net and selects x^. For the data in Figure 1(b) however, the eigenvectors (in 
particular, the principle eigenvector) will guide the EigenNet to select both x^ 
and x^. 
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We use a Bayesian framework to materialize the above ideas in the EigenNet, 
as shown in the foUowing section. 



3.2 Bayesian hybrid of conditional and generative models 

The EigenNet is a hybrid of conditional and generative models. The conditional 
component allows us to learn the classifier via "discriminative" training; the 
generative component captures the correlations between variables; and these two 
models are glued together via a joint prior distribution, so that the correlation 
information is used to guide the estimation of the classifier and the classification 
task is used to choose or scale relevant eigenvectors. Our approach is based on 
the general Bayesian framework proposed by Lasserre et al. 2006| ), which allows 



one to combine conditional and generative models in an elegant principled way. 

Specifically, for the conditional model we have the same likelihood as ([!]), 
p(y|X, w, &) = Y\^(T{yi{w'^Xi + &)). To sparsify the classifier, we can use a 
Laplace prior on w, 

p(w) = nAiexp{-AiK|}. (4) 



To encourage the classifier aligned with certain eigenvectors, we use the 
following generative model: 

p(Vs|w) cxexp(-y ^?7j||w-s,v,||^) (5) 

where 

llw - s,v,:"^ 



'i 1 1 + 



2 



= - l^2Y.^A\M' - 2s,|WTv,| + si (6) 

j 

s are nonnegative continuous variables, and rji are the i-th eigenvector and 
eigenvalue of the data covariance matrix, respectively. The reason we use ab- 
solute values of 'w'^vj in ^ is because we only care about the alignment of w 
and Vi, not the sign of their product. Overall, the above model encourages the 
classifier to more aligned with the major eigenvectors with bigger eigenvalues. 
But the variables s allow us to scale or select individual eigenvectors to remove 
irrelevant ones. 

To integrate the conditional and generative models, we use a joint prior on 
w and w: 

p(w,w) cx exp(-Ai|vif|i)exp(-y||w - w|p). (7) 
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Figure 2: The graphical model of the EigenNet. 

i.e., we have 

p(w,w) = Ai exp(— Ai|w|i)A/'(w|w, Aa^^). (8) 

Finally we can assign Gamma priors on all the hyperparameters, Ai, A2, and 
A3. The whole model is depicted in the graphical model in Figure [2j 

3.3 Reparameterization and constraint in Eigenspace 

In this section we reparameterize the model in the eigenspace: 

w = Va w = V/3 (9) 

where V = [vi, . . . , v^] (m — min{n,p}), and a. and fi are the projections of 
w and w on the eigenvectors, respectively. 

The reparameterization restricts w in the vector space spanned by {vi, . . . , Vm}, 
which is equivalent to the data space C(X), spanned by the data points {xi, . . . , x„}. 
When the number of features is bigger than the number of training points, i.e., 
J) > n, it effectively reduces the number of free parameters in the model, helping 
avoid overfitting. Furthermore, it provides significant computational advantage 
when p » n. 

Given p(w,w) and the relationship between (w,w) and (a,/?), we obtain 
p{a,l3) (Please see Appendix for the details): 

p(a,/3) «exp(-Ai|Va|i)exp(-^||a-/3|n (10) 

Based on the new reparameterization, the likelihood for the conditional 
model becomes 

p{y\X,a,b)=l[a{y,{KjYa + b)). (11) 
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(a) (b) 

Figure 3: Adaptive regularization of the EigenNet. The eUipses are the contours 
of a likehhood function. While the lasso draws the estimates towards the li ball, 
the EigenNet 's estimate is guided by an eigenvector v. 



Similarly, the likelihood for the generative model becomes 
p(V,s|/3)«exp(-iA2 5]?7,(||V/3|p 



-2.,|V/3Tv,| 



(12) 



The second equation holds since V is an orthonormal matrix. 

Combining ( 10 1, (111 and ( 12 1, we obtain a complete model. We use Markov 
Chain Monte Carlo with a random walk proposal to estimate the model param- 
eters s, w, and w. 



4 Alternative view: composite regularization 

In this section, we provide an alternative view to the EigenNet by considering 
the limiting case of A3 — >■ 0. For such as case the prior p{a., f3) becomes 

p{a,f3) ^ p{a.)6{a - (3) 

This forces a = (3. From a regularization perspective, this prior is equivalent 
to a composite regularizer: 

Ai|w| + ^5]r;,||w-.,v,||^ (13) 

=Ai|w| + y 5]r,,(||w|p - 2s,|wTv,| + s]) (14) 

Clearly, when s; — for all i's, the above regularizer reduces to the I1/2 regu- 
larizer used by the elastic net When 7^ then the regularizer is adaptive 

subtle difference is that we also constrain w in the data space for our model. 
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based on the eigenvector v^: First, if the elements of all have reasonably large 
values, then all the variables in w will very likely to be selected. This effect is 
visualized in Figure [3(b)[ Second, if this eigenvector has only several large ele- 
ments, the corresponding variables in w and w are likely to be selected jointly. 
Unlike the I1/2 regularizer that encourages the selection of groups of variables 
from all the variables, our regularizer directly targets at specific groups of vari- 
ables corresponding to the sparse eigenvector. Third, if all the variables are 
independent of each other, then the eigenvectors are parallel to the axes and 
each of them contains only one nonzero element. In this case |w"'"vj| reduces 
\wj\, a li regularizer. Figure 3(a) visualizes the eigen regularizer when variables 
are independent of each other. 

In summary, the EigenNet can be viewed as an adaptive generalization of 
the elastic net by selecting groups of correlated variables based on eigenvectors 
of the data covariance matrix. 



5 Related work 



The EigenNet can be viewed as an extension of the classical eigenface approaches 



Turk & Pentland 1991 Sirovich & Kirby 1987 . The eigenface approach uses 
PCA coefHcients of samples to train a classifier. Naturally the major eigen- 
vectors are often associated with large PCA coefficients and the classifier is 
constrained in the data subspace when the number of features is smaller than 
the number of training samples. The EigenNet essentially extends the eigenface 
approach by combining generative and conditional models in a Bayesian frame- 
work and performs sparse learning in an adaptive eigenspace (since the model 
selects or scales relevant eigenvectors based on sj). 

There are Bayesian versions of the lasso and the elastic net. Bayesian lasso 
Park et al. 2008 puts a hyper-prior on the regularization coefficient and use 



a Gibbs sampler to jointly sample both regression weights and the regulariza- 
tion coefficient. Using a similar treatment to Bayesian lasso, Bayesian elastic 
net Li & Lin 2010 samples the two regularization coefficients simultaneously. 



potentially avoiding the "double shrinkage" problem described in the original 
elastic net paper Zou & Hastie 2005 . As the EigenNet, these methods are 



grounded in a Bayesian framework, sharing the benefits of obtaining posterior 
distributions for handling estimation uncertainty. However, Bayesian lasso and 
Bayesian elastic net are presented to handle regression problems (though cer- 
tainly they can be generalized for classification problems) and sample in the 
original parameter space, not using the eigen information embedded in data. 
The EigenNet, by contrast, works in the eigenspace and uses eigen information 
to guide classification. 
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6 Experimental results 



We evaluate the new sparse Bayesian model, the EigenNet, on both synthetic 
and real data and compare it with three representative state-of-the-art variable 
selection methods, including the lasso, the elastic net, and the Bayesian lasso 
modified for classification problems. For the lasso and the elastic net we use 
the Glmnet software package that uses cyclical coordinate descent in a pathwise 
fashiorj^ The original Bayesian lasso was developed for regression and uses 
Gibbs sampling. For the classification tasks we consider, we change its Gaussian 
regression likelihood to the logistic likelihood ([T]) while keeping its Laplace prior 
distributions. We used Markov Chain Monte Carlo, instead of Gibbs sampler, to 
estimate the classifier for the Bayesian lasso. Bayesian approaches are capable of 
estimating all the hyperparameters from data. However, for easy and objective 
comparisons, we simply use cross-validation to tune the hyperparameters, Aj, 
for all methods. For the Bayesian lasso and the EigenNet, we draw the 300,000 
MCMC samples and use the last 150,000 samples to estimate the posterior mean 
of the classifiers, which are used for predicting the labels of test samples. We 
measure the prediction performance of all methods on test samples in terms of 
their average test error rate (e.g., the 0.2 error rate indicates 20% errors) and 
report the standard error of the error rates (except for the following visualization 
example) . 

6.1 Visualization of estimated classifiers 

First, we test these methods on synthetic data that contain correlated features. 
We sample 40 dimensional data points, each of which contains two groups of 
correlated variables. The correlation coefficient between variables in each group 
is 0.81 and there are 4 variables in each group. We set the values of the classifier 
weights in one group as 5 and in the other group as -5. We also generate 
the bias term randomly from a standard Gaussian distribution. We set the 
number of training points to 80. Figure |4] shows the estimated classifiers and 
the true classifier. It is not surprising that the elastic net identifies more features 
than the lasso. What is interesting is that EigenNet does not suppress many 
the irrelevant features to be exactly 0, but it clearly identifies all the relevant 
one, which dominate the irrelevant ones. To save space, we did not show the 
estimated classifier by the Bayesian lasso. Similar to the EigenNet, its classifier 
also contains many small, but nonzero weights. On this dataset, the test error 
rates of the lasso, the elastic net, the Bayesian lasso, and the EigenNet are 0.297, 
0.245, 0.251, and 0.137. 

An advantage of the Bayesian treatment for feature selection over frequen- 
tist approaches is to possibly uncover the correlations between the classifier 
weights. These correlations can be revealed by the covariance matrices of the 
joint posterior distribution over the classifier weights. In Figure [5] we visual- 
ize the quantized covariance matrices estimated by the Bayesian lasso and the 

^http:/ /www-stat. stanford.edu/ tibs/glmnet-matlab/ 
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10 20 30 40 50 10 20 30 40 50 

(a) Lasso (b) Elastic net 



(c) EigenNet (d) True 

Figure 4: Visualization of the lasso, the elastic net, the EigenNet and the true 
classifier weights. These classifiers are estimated on 80 training samples with 
40 features. Among the 40 features, 8 of them (as well as the bias) are relevant 
for the classification task. On this dataset the test error rates of the lasso, the 
elastic net, and the Bayesian lasso, the EigenNet are 0.297, 0.245, 0.251, and 
0.137. 



EigenNet. As shown in |5(a)| and |5(b)[ while the Bayesian lasso suggests some 
correlation structures among features, they are fairly noisy. By contrast, the 
EigenNet shows the two groups of correlated features much more clearly. 

6.2 Classification of synthetic data 

Now we systematically compare these methods on synthetic datasets containing 
correlated features and datasets containing independent features. For this first 
case, we use a similar procedure as in the visualization example: we sample 
40 dimensional data points, each of which contains two groups of correlated 
variables. The correlation coefficient between variables in each group is 0.81 and 
there are 4 variables in each group. However, unlike for the previous example 
where the classifier weights are the same for the correlated variables, now we 
set the weights within the same group to have the same sign, but with different 
random values. We vary the number of training points, ranging from 10 to 
80, and test all these methods. For the datasets with independent features, we 
follow the same procedure except that the features are independently sampled. 

We run the experiments 10 times. Figure |6] shows the error rates averaged 
over 10 runs. We do not plot the standard errors of the test error rates, since 
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(a) Bayesian lasso (b) EigenNet 

Figure 5: Covariancc matrices of the Bayesian lasso and the EigenNet classi- 
fiers. The covariance matrices are estimated based on the MCMC samples for 
these two models. We use 80 training samples with 40 features per sample. The 
covariance matrix of the EigenNet classifier correctly suggests the last few fea- 
tures are correlated. In particular, it clearly identifies a group of four correlated 
features. 

they have very small values: the biggest one is less than 0.0183 for the results 
on data with correlated features, and for the results on data with independent 
features, the biggest one is less than 0.030. We report the numerical values 
of both the averaged error rates and the standard errors in the supplemental 
materials. 

For the datasets with independent features, the EigenNet outperforms the 

alternative methods when the number of training samples are smaller than 40, 
the number of features (i.e., p > n). Since in this case the eigenstructures of the 
datasets are uninformative, we expect the improved prediction accuracy is the 
result of the subspacc constraint used by the EigenNet. And once the number 
of training samples are not bigger than the data dimension, all these methods 
perform quite similarly. 

For the datasets with correlated features, the EigenNet significantly out- 
performs the alternative methods consistently, not only when the number of 
training samples are smaller than 40 (p > n) but also when it is not. We believe 
this is because the EigenNet uses the valuable eigcn information revealing the 
feature correlations to train its classifiers. Note that although the result of the 
elastic net appear to overlaps with those of the lasso. Actually for the data with 
correlated features, the elastic net often slightly outperforms the lasso (Please 
their numerical values in the supplemental materials). 

6.3 Classification of real data 

Besides the synthetic data, we also test all these methods on UCI benchmark 
datasets, two high-dimensional gene expression datasets, leukaemia and colon 
cancer, and a spambasc dataset with relatively lower dimension but a lot more 
training samples. 

For the leukaemia dataset, the task is to distinguish acute myeloid leukaemia 
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Figure 6: Test error rates on synthetic datasets with independent features and 
with correlated features. Each training sample has 40 features, 8 of which 
are revelent features. We increase the number of training samples from 10 
to 80 and use 2000 test samples each time. The results are averaged over 
10 runs. For the data with independent features, the EigenNet outperforms 
the alternative methods at beginning when the number of training samples 
are fewer than 40, the number of the features. With more training samples 
containing independent features, all these methods perform comparably. For 
data with correlated features, the EigenNet outperforms the alternative methods 
consistently. 



(AML) from acute lymphoblastic leukaemia (ALL). The whole dataset has 47 
and 25 samples of type ALL and AML respectively with 7129 features per 
sample. The dataset was randomly split 20 times into 37 training and 35 test 
samples. 

For the colon cancer dataset, the task is to discriminate tumor from normal 
tissues using microarray data. The dataset has 22 normal and 40 cancer samples 
with 2000 features per sample. We randomly split the dataset into 31 training 
and 31 test samples 10 times. 

For the spambase datast, the task is to detect spam emails, i.e., unsolicited 
commercial emails. We use 57 features indicating whether a particular word 
or character was frequently occurring in the emails. We randomly split the 
dataset into 1533 training and 3066 test samples 10 times. Note that we do 
not use any kernel here and the results on this dataset are meant to examine 
how the performance of these methods compares to each other when there are 
more samples than features. Using a nonlinear basis function, e.g., a radial basis 
function, is expected to boost the predictive performance of all these methods. 

Figure [7] summarizes the average test error rates and the standard errors of 
these methods on the three datasets. Again, the EigenNet significantly outper- 
forms the alternative methods on three datasets. Note that for the leukaemia 
and colon cancer datasets Bayesian lasso does not perform much worse than 
the other methods. The reason, we believe, is that these two high dimensional 
datasets contain thousands of features and Bayesian lasso directly draws sam- 



12 




(a) Spambase (b) Colon 




(c) Leukemia 



Figure 7: Test error rates on spambase, leukemia and colon cancer datasets. 
The error bars represent the standard errors of the error rates. The results on 
the spambase and colon cancer datasets are averaged over 10 random partitions 
and the results on the leukemia dataset are averaged over 20 partitions. 



pies in such high dimensional spaces, leading to very slow mixing rates. By 
contrast, the EigenNet draws samples efficiently in a much smaller eigenspace, 
not only leading to faster mixing rates but also greatly saving the computing 
cost for obtaining each sample. 



7 Conclusions 

In this paper, we have presented a novel sparse Bayesian hybrid model, the 
EigenNet. It integrates a sparse conditional classification model with a genera- 
tive model capturing the feature correlations. It also generalizes the elastic net 
by explicitly exploring correlations between features. Compared with several 
state-of-the art methods, the EigenNet achieves significantly improved predic- 
tion accuracy on several benchmark datasets. 

We plan to extend our hybrid model by utilizing other probabilistic genera- 
tive models, such as sparse principle component analysis and related projection 



methods Guan & Dy Archambeau & Bach 2009 and independent component 
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analysis models. Compared to the classical PCA models, these models could be 
used to better guide the selection of interdependent sparse features. 
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Appendix 

Given the linear relationship between {a, (3) and (w,w), the prior p(w,w) 
defined in ^ is equivalent to p{a,f3) defined in ([Io|. 

First, when n > p, we can easily obtain the p{a., (3) from p(w,w). In this 
case, the number of eigenvectors is p and the Jacobian matrix is the pxp full rank 
matrix V. Furthermore, the determinant of V is 1 since V is an orthonormal 
matrix. Therefore, with [w, w] = V[a,/3] we have p(a,/3) = p(w, w). 

When p > n, Vpxn is a tall matrix and therefore we cannot compute its 
determinant to transform the prior distribution p(a,/3). Now p(w, w) is essen- 
tially a distribution on the data subspace embedded in the high dimensional 
space TZP. To obtain the equivalence between these two priors, we consider the 



following theorem Petersen & Pedersen 2008 



Theorem 1 If A is "tall", i.e., "under-determined", then p{x.) = Jp{s)d{x — 
( , ^ p(A+x) if X = AA+x 

As)ds - \ V\^r^ ' 

y otherwise 

Using this theorem and the fact |VV+| = 1, we see that with the simple linear 
relationship between the variables, p(a,f3) — p(w,w). 
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